{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=6.5 color=\"blue\">** Recommendations **</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 寻找相近的用户"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "code_folding": [
     0
    ],
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "critics={\n",
    "    'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,'Just My Luck': 3.0, 'Superman Returns': 3.5, \n",
    "                  'You, Me and Dupree': 2.5, 'The Night Listener': 3.0},\n",
    "    'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, \n",
    "                     'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 'You, Me and Dupree': 3.5}, \n",
    "    'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,'Superman Returns': 3.5, 'The Night Listener': 4.0},\n",
    "    'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,'The Night Listener': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 2.5},\n",
    "    'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,\n",
    "                     'You, Me and Dupree': 2.0}, \n",
    "    'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,'The Night Listener': 3.0, 'Superman Returns': 5.0, \n",
    "                      'You, Me and Dupree': 3.5},\n",
    "    'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}} "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.5"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "critics['Lisa Rose']['Lady in the Water']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 相似度评价值\n",
    "- 欧几里得距离（Euclidean Distance Score）\n",
    "- 皮尔逊相关度（Pearson Correlation Score）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 欧氏距离"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**评价值：**\n",
    "$$\\frac{1}{1+欧氏距离}$$\n",
    "结果在`[0,1]`之间，数值越大，越相关"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from math import sqrt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sim_distance(prefs,person1,person2):\n",
    "    shared_items={}\n",
    "    for item in prefs[person1]:\n",
    "        if item in prefs[person2]:\n",
    "            shared_items[item]=1\n",
    "        \n",
    "    if len(shared_items)==0: return 0\n",
    "    \n",
    "    sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2) \n",
    "                        for item in shared_items if item in prefs[person2]])\n",
    "    \n",
    "    return 1/(1+ sqrt(sum_of_squares))      #为保持和书本“一样错”，正确语句：1/(1+ sqrt(sum_of_squares))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.29429805508554946"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sim_distance(critics,'Lisa Rose', 'Gene Seymour')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**两两计算欧几里得距离：**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def calcu(prefs,func=sim_distance):\n",
    "    key_l = list(critics.keys())    \n",
    "    score_d={}\n",
    "    for i in range(0, len(key_l)-1):      #列表index是从“0”开始的\n",
    "        for m in range(i+1, len(key_l)):\n",
    "            score = func(critics,key_l[i],key_l[m])\n",
    "            score_d[key_l[i]+' : '+ key_l[m]]=score      \n",
    "    return score_d"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "score_d=calcu(critics)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Gene Seymour : Jack Matthews', 0.8)"
      ]
     },
     "execution_count": 123,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "score_sorted= sorted(score_d.items(),key=lambda item:item[1])  #对字典进行排序，返回d.items()\n",
    "# keys_sorted_by_value = sorted(score_d,key=lambda x:score_d[x])    #按键值排序，返回键的顺序\n",
    "score_sorted[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 皮尔逊相关系数\n",
    "参考：\n",
    "1. [如何理解皮尔逊相关系数（Pearson Correlation Coefficient-知乎）？](https://www.zhihu.com/question/19734616)\n",
    "2. [皮尔逊积矩相关系数-WikiPedia](https://zh.wikipedia.org/wiki/%E7%9A%AE%E5%B0%94%E9%80%8A%E7%A7%AF%E7%9F%A9%E7%9B%B8%E5%85%B3%E7%B3%BB%E6%95%B0)\n",
    "$$r=\\frac{\\sum (x - \\overline{x}) (y - \\overline{y}) }{\\sqrt{\\sum (x - \\overline{x})^2 (y - \\overline{y})^2}}$$\n",
    "<font color='red'>分母为0的情况如何剔除？</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.stats import pearsonr  #return r and p-value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def sim_pearson(prefs,p1,p2):\n",
    "    shared_items={}\n",
    "    for item in prefs[p1]:\n",
    "        if item in prefs[p2]:\n",
    "            shared_items[item]=1\n",
    "    n=len(shared_items)\n",
    "    if n<=0: return 0\n",
    "    \n",
    "#     x1=[prefs[p1][item] for item in shared_items]\n",
    "#     x2=[prefs[p2][item] for item in shared_items]\n",
    "#     r,p_value= pearsonr(x1,x2)\n",
    "#     if p_value==1: return 0\n",
    "    \n",
    "    sum1=sum([prefs[p1][it] for it in shared_items])\n",
    "    sum2=sum([prefs[p2][it] for it in shared_items])\n",
    "  \n",
    "    # Sums of the squares\n",
    "    sum1Sq=sum([pow(prefs[p1][it],2) for it in shared_items])\n",
    "    sum2Sq=sum([pow(prefs[p2][it],2) for it in shared_items])\n",
    "  \n",
    "    # Sum of the products\n",
    "    pSum=sum([prefs[p1][it]*prefs[p2][it] for it in shared_items])\n",
    "  \n",
    "    # Calculate r (Pearson score)\n",
    "    num=pSum-(sum1*sum2/n)\n",
    "    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))\n",
    "    if den==0: return 0\n",
    "    r=num/den\n",
    "\n",
    "    return r"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 197,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.99124070716193036"
      ]
     },
     "execution_count": 197,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sim_pearson(critics,'Lisa Rose','Toby')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "score_r=calcu(critics,sim_pearson)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Michael Phillips : Claudia Puig', 1.0)"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "score_r_sorted= sorted(score_r.items(),key=lambda item:item[1])\n",
    "score_r_sorted[-1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### 其他函数\n",
    "- [Jaccard系数-百度百科](https://baike.baidu.com/item/Jaccard%E7%B3%BB%E6%95%B0/6784913?fr=aladdin)\n",
    "- [曼哈顿距离-百度百科](https://baike.baidu.com/item/%E6%9B%BC%E5%93%88%E9%A1%BF%E8%B7%9D%E7%A6%BB/743092?fr=aladdin)\n",
    "- Tanimoto分值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 打分计算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def topMatches(prefs,person,n=5,similarity=sim_pearson):\n",
    "    scores=[(similarity(prefs,person,other),other) for other in prefs if other!=person]    #字典转化成 元组 列表，便于排序\n",
    "    scores.sort()\n",
    "    scores.reverse()\n",
    "    return scores[0:n]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.9912407071619299, 'Lisa Rose'), (0.9244734516419049, 'Mick LaSalle')]"
      ]
     },
     "execution_count": 132,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "topMatches(critics,'Toby',n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 推荐物品"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def getRecommendations(prefs,person,similarity=sim_pearson):\n",
    "    totals = {}\n",
    "    simSums= {}\n",
    "    for other in prefs:\n",
    "        if other == person: continue    #跳出本次循环\n",
    "        sim =similarity(prefs,person,other)\n",
    "        if sim<=0: continue\n",
    "        for item in prefs[other]:\n",
    "            if item not in prefs[person] or prefs[person][item]==0:\n",
    "                totals.setdefault(item,0)\n",
    "                totals[item]+=sim*prefs[other][item]\n",
    "                simSums.setdefault(item,0)\n",
    "                simSums[item]+=sim\n",
    "    rankings=[(total/simSums[item], item) for item,total in totals.items()]\n",
    "    \n",
    "    rankings.sort()\n",
    "    rankings.reverse()\n",
    "    return rankings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(3.3477895267131013, 'The Night Listener'),\n",
       " (2.8325499182641614, 'Lady in the Water'),\n",
       " (2.530980703765565, 'Just My Luck')]"
      ]
     },
     "execution_count": 147,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getRecommendations(critics,'Toby')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 匹配商品\n",
    "相近商品匹配"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def transformPrefs(prefs):\n",
    "    result={}\n",
    "    for person in prefs:\n",
    "        for item in prefs[person]:\n",
    "            result.setdefault(item,{})\n",
    "            result[item][person]=prefs[person][item]\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.6579516949597695, 'You, Me and Dupree'),\n",
       " (0.4879500364742689, 'Lady in the Water'),\n",
       " (0.11180339887498941, 'Snakes on a Plane'),\n",
       " (-0.1798471947990544, 'The Night Listener'),\n",
       " (-0.42289003161103106, 'Just My Luck')]"
      ]
     },
     "execution_count": 149,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies = transformPrefs(critics)\n",
    "topMatches(movies,'Superman Returns')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]"
      ]
     },
     "execution_count": 150,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getRecommendations(movies, 'Just My Luck')    #对未观看该电影的人评分进行预测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 链接推荐系统\n",
    "基于[del.icio.us](www.del.icio.us)\n",
    "<br>网站服务器切换，API错误，待完成。\n",
    "<br>日期：2018/7/E\n",
    "<br>Page：19"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 基于物品的推荐"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 183,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def calculateSimilarItems(prefs,n=10,p=False):\n",
    "    result={}\n",
    "    itemPrefs=transformPrefs(prefs)\n",
    "    c=0\n",
    "    for item in itemPrefs:\n",
    "        c+=1\n",
    "        if p and c%100==0 : print(\"%d / %d\" %(c,len(itemPrefs))) \n",
    "        scores =topMatches(itemPrefs, item, n=n, similarity=sim_distance)\n",
    "        result[item] =scores\n",
    "    return result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "构件物品相似度数据集 `itemsim`\n",
    "<br>元组组成的列表，相当于一个字典，比字典更“稳定”"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.4, 'You, Me and Dupree'), (0.2857142857142857, 'The Night Listener')]"
      ]
     },
     "execution_count": 152,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "itemsim=calculateSimilarItems(critics)    #与书本结果不一致原因：源码错误，欧氏距离计算时未取平方根\n",
    "itemsim['Lady in the Water'][:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def getRecommendedItems(prefs,itemMatch,user):\n",
    "    userRatings = prefs[user]\n",
    "    scores={}\n",
    "    totalSim={}\n",
    "    for (item, rating) in userRatings.items():      #同时取出元组的两个值，再按任意顺序调用\n",
    "        for (similarity, item2) in itemMatch[item]:\n",
    "            if item2 in userRatings: continue\n",
    "            scores.setdefault(item2,0)\n",
    "            scores[item2]+= similarity*rating\n",
    "            \n",
    "            totalSim.setdefault(item2,0)\n",
    "            totalSim[item2]+= similarity\n",
    "    rankings = [(score/totalSim[item], item) for item, score in scores.items()]\n",
    "    rankings.sort()\n",
    "    rankings.reverse()\n",
    "    return rankings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(3.1826347305389224, 'The Night Listener'),\n",
       " (2.5983318700614575, 'Just My Luck'),\n",
       " (2.4730878186968837, 'Lady in the Water')]"
      ]
     },
     "execution_count": 154,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getRecommendedItems(critics, itemsim, 'Toby')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Movie Lens数据集\n",
    "数据集[下载](https://grouplens.org/datasets/movielens/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def loadMoviLens(path='data/movielens'):\n",
    "    movies={}\n",
    "    for line in open(path + '/u.item', encoding='latin-1'):\n",
    "        (id,title) = line.split('|')[0:2]\n",
    "        movies[id] = title\n",
    "    \n",
    "    prefs={}\n",
    "    for line in open(path + '/u.data',encoding='latin-1'):\n",
    "        (user, movieid, rating, ts)= line.split('\\t')\n",
    "        prefs.setdefault(user,{})\n",
    "        prefs[user][movies[movieid]]=float(rating)\n",
    "    \n",
    "    return prefs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "prefs=loadMoviLens()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(5.0, 'They Made Me a Criminal (1939)'),\n",
       " (5.0, 'Star Kid (1997)'),\n",
       " (5.0, 'Santa with Muscles (1996)'),\n",
       " (5.0, 'Saint of Fort Washington, The (1993)'),\n",
       " (5.0, 'Marlene Dietrich: Shadow and Light (1996) ')]"
      ]
     },
     "execution_count": 207,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getRecommendations(prefs,'87')[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "itemsim=calculateSimilarItems(prefs,n=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(5.0, \"What's Eating Gilbert Grape (1993)\"),\n",
       " (5.0, 'Vertigo (1958)'),\n",
       " (5.0, 'Usual Suspects, The (1995)'),\n",
       " (5.0, 'Toy Story (1995)'),\n",
       " (5.0, 'Titanic (1997)')]"
      ]
     },
     "execution_count": 203,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "getRecommendedItems(prefs,itemsim,'87')[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 基于用户的推荐"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 208,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def calculateSimilarUsers(prefs,n=10):\n",
    "    result={}\n",
    "    \n",
    "    for item  in prefs:\n",
    "        scores=topMatches(prefs,item,n=n,similarity=sim_pearson)\n",
    "        result[item]=scores\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 211,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "usersim = calculateSimilarUsers(prefs,n=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 音乐推荐系统\n",
    "基于[Audioscrobbler](http://www.audioscrobbler.net/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  },
  "toc": {
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "toc_cell": false,
   "toc_position": {
    "height": "679px",
    "left": "0px",
    "right": "1366px",
    "top": "107px",
    "width": "234px"
   },
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
